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Abstract 

Software  Dynamic  Translation  (SDT)  systems  have  been  used  for 
program  instrumentation,  dynamic  optimization,  security  policy 
enforcement,  intrusion  detection,  and  many  other  uses.  To  be 
widely  applicable,  the  overhead  (runtime,  memory  usage,  and 
power  consumption)  should  be  as  low  as  possible.  For  instance,  if 
an  SDT  system  is  protecting  a  web  server  against  possible  attacks, 
but  causes  30%  slowdown,  a  company  may  need  30%  more 
machines  to  handle  the  web  traffic  they  expect.  Consequently,  the 
causes  of  SDT  overhead  should  be  studied  rigorously. 

This  work  evaluates  many  alternative  policies  for  the  creation  of 
fragments  within  the  Strata  SDT  framework.  In  particular,  we 
examine  the  effects  of  ending  translation  at  conditional  branches; 
ending  translation  at  unconditional  branches;  whether  to  use  partial 
inlining  for  call  instructions;  whether  to  build  the  target  of  calls 
immediately  or  lazily;  whether  to  align  branch  targets;  and  how  to 
place  code  to  transition  back  to  the  dynamic  translator.  We  find  that 
effective  translation  strategies  are  vital  to  program  performance, 
improving  performance  from  as  much  as  28%  overhead,  to  as  little 
as  3%  overhead  on  average  for  the  SPEC  CPU2000  benchmark 
suite.  We  further  demonstrate  that  these  translation  strategies  are 
effective  across  several  platforms,  including  Sun  SPARC  UltraS- 
parc  Hi,  AMD  Athlon  Opteron,  and  Intel  Pentium  IV  processors. 
Categories  and  Subject  Descriptors  B.m  [Hardware]:  Miscella¬ 
neous,  C.4  [Performance  of  Systems]:  Performance  attributes, 
D.3.0  [Programming  Languages]:  General,  D.3.4  [Program¬ 
ming  Languages]:  Processors  -  Code  Generation,  Compilers, 
Incremental  Compilers,  Interpreters,  Optimization,  Run-time  Envi¬ 
ronments 

General  Terms  Algorithms,  Measurement,  Performance,  Design, 
Experimentation. 
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1.  Introduction 

Over  the  years  software  dynamic  translation  (SDT) — the  program¬ 
matic  modification  of  a  running  program’s  binary  instructions — 
has  become  an  increasingly  useful  technique  in  the  system  imple¬ 
mentor’s  repertoire.  A  wide  variety  of  systems  can  be  classified  as 
software  dynamic  translators,  including  dynamic  optimizers, 
dynamic  binary  translators,  dynamic  instrumentation  systems, 
dynamic  software  updaters,  and  certain  emulators  and  simulators. 

SDT  affords  system  designers  unprecedented  flexibility  in  control¬ 
ling  and  modifying  a  program’s  execution.  This  flexibility  allows 
software  dynamic  translation  to  be  used  to  accomplish  several 
objectives  not  easily  achieved  via  other  means.  For  instance,  SDT 
may  be  used  to  overcome  the  barriers  to  entry  associated  with  the 
introduction  of  a  new  OS  or  CPU  architecture.  Transmeta’s  Code 
Morphing  technology  is  used  for  this  very  purpose;  i.e.,  allowing 
unmodified  Intel  IA-32  binaries  to  run  on  the  low-power,  VLIW 
Crusoe  processor  [7],  Similarly,  the  UQDBT  system  dynamically 
translates  Intel  IA-32  binaries  to  run  on  SPARC-based  processors 
[23],  and  FX!32  dynamically  translates  x86  binaries  to  run  on 
Alpha  processors  [5], 

In  addition  to  allowing  designers  to  overcome  cost  barriers  to  new 
platform  acceptance,  the  flexibility  of  SDT  has  proven  useful  for 
other  purposes.  For  instance,  Shade  uses  SDT  to  implement  high- 
performance  instruction  set  simulators  [6].  Embra  uses  SDT  to 
implement  a  high-performance  operating  system  emulator  [24], 
Dynamo  and  Mojo  use  SDT  to  improve  the  performance  of  native 
binaries  [1,  4],  and  DAISY  uses  software  dynamic  translation  to 
evaluate  the  performance  of  novel  VLIW  architectures  and  accom¬ 
panying  optimization  techniques  [9].  The  Keminst  [22,  21]  and 
Vulcan  [20]  systems  use  SDT  to  insert  program  monitoring  instru¬ 
mentation  into  running  programs.  More  recently  SDT  has  been 
used  to  ensure  safe  execution  of  untrusted  binaries  [11,  14,  15,  17]. 

Despite  the  many  applications  of  SDT  and  the  lively  state  of 
research  into  novel  uses  of  SDT,  one  major  obstacle  for  more  per¬ 
vasive  use  of  dynamic  translators  is  the  cost  of  applying  transfor¬ 
mations  at  runtime.  When  an  SDT  system  adds  too  much  overhead, 
in  the  form  of  execution  time,  memory  requirements,  disk  require¬ 
ments,  or  network  traffic,  the  system  is  less  likely  to  be  used.  For 
instance,  an  SDT  system  may  be  used  to  protect  a  web  server  from 
malicious  attackers  trying  to  gain  access  to  the  system.  However,  if 
the  SDT  system  causes  the  web  server  to  run  30%  slower  than 
without  the  SDT  system,  a  large  web-hosting  firm  may  need  up  to 
30%  more  CPUs  to  handle  the  same  amount  of  traffic  during  peri¬ 
ods  of  peak  demand.  This  increase  corresponds  to  a  30%  increase 
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in  initial  hardware  costs,  but  also  up  to  a  30%  increase  in  peak 
electricity  usage,  a  30%  increase  in  space  required  to  store  the 
devices,  and  up  to  a  30%  increase  in  peak  cooling  costs  for  the 
building!  With  such  significantly  increased  cost,  the  SDT  system 
may  be  an  undesirable  solution.  Instead,  if  the  SDT  system  intro¬ 
duces  only  1%  overhead,  it  may  well  be  worth  the  extra  hardware 
and  utility  costs  to  prevent  malicious  attacks.  Consequently,  we  see 
that  reducing  SDT  overhead  is  of  utmost  importance  for  the  perva¬ 
sive  use  of  SDT  technology. 

The  causes  of  overhead  in  SDT  systems  need  to  be  thoroughly  and 
rigorously  evaluated.  A  fundamental  operation  done  by  an  SDT 
system  is  to  translate  and  cache  code  fragments  (the  basic  code 
granularity  of  translation).  To  this  end,  we  examine  a  variety  of 
fragment  construction  policies,  their  possible  benefits  and  draw¬ 
backs.  In  particular,  we  study  and  evaluate  some  of  the  design  deci¬ 
sions  that  were  made  within  Strata  [14],  a  publicly  available, 
retargetable  SDT  system  supporting  three  instruction  sets:  IA-32, 
SPARC,  and  MIPS. 

We  find  that  several  of  Strata’s  initial  design  decisions  were  ineffi¬ 
cient  and  caused  overhead  in  several  ways.  Some  decisions  caused 
extra  instructions  to  be  executed,  while  other  decisions  hurt  the 
performance  of  the  processor  and  memory  hierarchy,  including 
branch  prediction  accuracy  and  cache  locality. 

The  major  contributions  of  this  paper  are: 

•  It  describes  several  decisions  that  must  be  made  when  build¬ 
ing  a  dynamic  translation  system  for  low  overhead. 

•  The  paper  empirically  determines  the  best  set  of  fragment  cre¬ 
ation  policies  to  use  when  building  a  low  overhead  SDT  sys¬ 
tem. 

•  The  paper  demonstrates  significant  overhead  reduction  on 
three  machines,  reducing  overhead  to  as  little  as  3%  on  aver¬ 
age  for  an  AMD  Athlon  Opteron  244. 

•  This  work  presents  evidence  that  the  fragment  creation  tech¬ 
niques  described  are  efficient  at  reducing  overhead  across  a 
variety  of  architectures,  not  just  a  single  architecture  used  for 
initial  experimentation. 

•  This  work  explains  the  remaining  causes  of  overhead  that  we 
were  unable  to  remove. 

•  Finally,  it  puts  forth  the  notion  that  the  causes  of  overhead  can 
be  due  to  inhibiting  the  underlying  hardware  from  peak  per¬ 
formance  rather  than  due  to  increases  in  instruction  counts  or 
time  spent  within  the  SDT  system. 

The  remainder  of  this  paper  is  organized  as  follows:  Section  2 
describes  Strata,  the  SDT  system  used  in  this  work.  Section  3  pre¬ 
sents  our  findings  using  experiments  designed  to  find  the  lowest 
overhead  option  of  many  different  translations  strategies.  Section  4 
highlights  other  research  to  reduce  SDT  overhead,  and  finally  Sec¬ 
tion  5  summarizes  our  findings. 

2.  Strata 

Strata  is  an  SDT  system  designed  for  high  retargetability  and  low 
overhead  translation.  Strata  has  been  used  for  a  variety  of  applica¬ 
tions  including  system  call  monitoring,  dynamic  download  of  code 
from  a  server,  and  enforcing  security  policies  [12,  25],  This  section 
describes  some  of  the  basic  features  of  Strata  which  are  important 
to  understanding  the  experiments  presented  later.  For  an  in  depth 


discussion  of  Strata,  please  refer  to  previous  publications  [14,  16, 

17]. 

2.1  Overview 

Strata  operates  as  a  co-routine  with  the  program  binary  it  is  trans¬ 
lating,  as  shown  in  Figure  1.  As  the  figure  shows,  each  time  Strata 
encounters  a  new  instruction  address  (i.e.,  PC),  it  first  checks  to  see 
if  the  address  has  been  translated  into  the  fragment  cache.  The 
fragment  cache  is  a  software  instruction  cache  that  stores  portions 
of  code  that  have  been  translated  from  the  native  binary.  The  frag¬ 
ment  cache  is  made  up  of  fragments,  which  are  the  basic  unit  of 
translation.  If  Strata  finds  that  a  requested  PC  has  not  been  previ¬ 
ously  translated,  Strata  allocates  a  fragment  and  begins  translation. 
Once  a  termination  condition  is  met,  Strata  emits  any  trampolines 
that  are  necessary.  Trampolines  are  pieces  of  code  emitted  into  the 
fragment  cache  to  transfer  control  back  to  Strata.  Most  control 
transfer  instructions  (CTIs)  are  initially  linked  to  trampolines 
(unless  its  target  previously  exists  in  the  fragment  cache).  Once  a 
CTI’s  target  instruction  becomes  available  in  the  fragment  cache, 
the  CTI  is  linked  directly  to  the  destination,  avoiding  future  uses  of 
the  trampoline.  This  mechanism  is  called  Fragment  Linking  and 
avoids  significant  overhead  associated  with  returning  to  Strata  after 
every  fragment  [14], 

Strata’s  translation  process  can  be  overriden  to  implement  a  new 
SDT  use.  The  basic  Strata  system  includes  several  default  behav¬ 
iors  that  control  the  creation  of  code  fragments.  These  default  deci¬ 
sions  can  be  changed  based  on  a  particular  SDT  use.  This  paper 
evaluates  which  choices  are  indeed  the  best  as  the  default. 


Figure  1:  High-level  overview  of  how  Strata  operates. 


2.2  Indirect  Branches 

Indirect  branches1  require  special  handling.  Since  indirect  branch 
targets  can  change  during  each  execution  of  the  instruction,  they 
cannot  be  linked  at  translation-time  to  their  target  in  the  fragment 
cache.  To  handle  indirect  branch  targets  efficiently,  Strata  emits  a 
sequence  of  instructions  that  perform  a  table  lookup.  The  table 
lookup  maps  application  PC’s  to  their  fragment  PC  equivalents.  On 


1.  For  this  work,  indirect  calls  are  considered  a  special  type  of  indirect 
branch. 
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a  table  miss,  control  is  transferred  back  to  Strata  which  builds  the 
requested  fragment  if  necessary,  updates  the  lookup  table  and 
finally  transfers  control  to  the  destination  fragment. 

Return  instructions  are  a  special  kind  of  indirect  branch  that  Strata 
optimizes  via  a  method  called  Fast  Returns.  Instead  of  emitting  the 
nonnal  sequence  of  instructions  to  do  a  table  lookup,  it  directly 
emits  the  return  instruction.  Normally  this  would  cause  the  transla¬ 
tor  to  lose  control  of  the  running  application  after  the  first  return 
instruction  is  executed,  since  control  would  transfer  back  to  an 
application  PC.  Strata  prevents  this  by  translating  call  instructions 
specially.  Call  instructions  normally  write  their  return  address  into 
a  specified  location.  Strata  translates  these  instructions  to  write  the 
corresponding  fragment  cache  return  address.  Such  translation 
causes  the  return  instruction  to  return  to  another  fragment  within 
the  fragment  cache,  and  Strata  can  consequently  maintain  control. 
We  choose  to  enable  this  option  as  it  provides  for  an  aggressive 
baseline  in  which  to  attempt  to  remove  more  overhead.  Please  see 
previous  publications  for  further  detail  about  indirect  branch  han¬ 
dling  mechanisms  [16]. 

3.  Fragment  Translation  Policies 

This  section  describes  the  translation  strategies  used  within  Strata, 
and  experimental  evidence  for  changing  some  of  Strata’s  initial 
design  decisions.  Section  3.1  first  describes  the  setup  used  for 
these  experiments. 

3.1  Experimental  Setup 

All  perfonnance  experiments  in  Sections  3. 2-3 .7  are  perfonned  on 
an  AMD  Athlon  Opteron  244  (1.8GHz)  running  Fedora  Core  4, 
with  Linux  kernel  2.6.11.  The  kernel  and  all  software  is  built  using 
the  IA-32  instruction  set.  Benchmarks  are  compiled  with  the  GNU 
gcc  compiler  and  the  options  -03  -fomit-frame-pointer  with 
dynamic  linking.  The  entire  set  of  SPEC  CPU2000  benchmarks  are 
used  for  performance  analysis.  Some  graphs  presented  use  the 
PAPI  statistic  collection  mechanisms  to  gather  information  about 
processor  events  such  as  instruction  cache  misses  via  hardware 
counters  [2].1  Graphs  are  normalized  to  native  execution  time 
when  possible.  In  these  graphs,  smaller  numbers  represent  better 
perfonnance.  The  fragment  cache  size  is  set  to  255MB,  so  that 
removing  blocks  from  the  fragment  cache  is  never  necessary.  How¬ 
ever,  only  1  benchmark  exceeded  10MB  of  fragments  in  the  frag¬ 
ment  cache  with  the  optimized  configuration. 

3.2  Normal  Instructions 

Normal  instructions  are  simple  instructions  such  as  add  or  mov  that 
cause  no  control  transfer.  Such  instructions  are  copied  as-is  from 
the  application  text  into  the  fragment  cache  and  no  special  han¬ 
dling  is  needed.  Consequently,  translation  of  these  instructions 
leaves  little  chance  that  overhead  is  created,  and  no  design  alterna¬ 
tives  to  evaluate. 

3.3  Direct,  Conditional  CTIs 

Direct,  conditional  CTIs,  such  as  beq  LI  or  jecxz  L2,  require  trans¬ 
lating  the  target  address  from  an  application  address  to  a  fragment 


1 .  These  graphs,  due  to  software  integration  issues,  are  collected  from 
a  Intel  Pentium  IV  Xeon  machine,  running  Red  Hat  Linux  7.3  2.96- 
110,  with  kernel  version  2.6. 1  lperfctr-2.6. 


cache  address.  They  are  also  good  points  to  consider  ending  frag¬ 
ment  construction.  To  understand  why,  consider  that  CTIs  some¬ 
times  have  a  highly-biased  outcome,  they  are  either  taken  or  not 
taken  for  the  entire  run  of  the  program.  Branches  that  detect  error 
conditions  are  often  of  this  fonn.  If  the  translator  were  to  continue 
translating  code  at  either  the  fall-through  or  target  of  the  branch, 
then  the  translator  is  perfonning  extra  work  and  creating  extra 
pressure  in  the  fragment  cache  and  hardware  instruction  cache.  On 
the  other  hand,  if  the  translator  stops  translating  at  the  branch  and  it 
is  not  highly  biased,  extra  instructions  will  be  needed  along  the 
fall-through  path  to  link  the  branch’s  fall-through  to  the  fragment 
for  not  taken  code. 

It  is  possible  to  conditionally  stop  the  fragment  build  1 )  if  the  tar¬ 
get  of  the  branch  already  exists  in  the  fragment  cache;  2)  if  the  fall- 
through  of  the  branch  already  exists  in  the  fragment  cache;  3)  if 
both  the  fall-through  and  the  target  exist  in  the  fragment  cache;  or 
4)  if  either  the  target  or  the  fall-through  exist  in  the  fragment  cache. 

Figure  2  shows  the  results  when  each  of  these  possibilities  is 
implemented  within  Strata.2  The  results  indicate  that  always  end¬ 
ing  the  fragment  at  a  conditional  branch  generally  has  a  significant 
perfonnance  penalty,  especially  in  the  integer  benchmarks  which 
are  less  memory  bound  and  more  control-intensive.  Perfonnance  is 
generally  improved  over  Strata’s  baseline  (always  stop)  by  condi¬ 
tionally  ending  the  fragment,  but  the  best  performance  comes  from 
unconditionally  continuing  to  build  the  fragment.  Over  5% 
improvement  can  be  seen  on  average.  The  only  other  fragment 
building  mechanism  that  comes  close  is  ending  if  both  the  fall- 
through  and  the  target  previously  exist  in  the  fragment  cache.  This 
mechanism’s  perfonnance  is  close  to  unconditionally  continuing 
because  the  stopping  condition  is  rarely  met. 

Consequently,  we  believe  that  never  ending  a  fragment  at  a  condi¬ 
tional  branch  is  the  best  way  to  reduce  overhead.  All  further  exper¬ 
iments  will  not  end  fragment  building  at  conditional  branches. 

3.4  Handling  Direct,  Unconditional  CTIs 

Unconditional,  direct  CTIs,  such  as  jmp  LI,  present  an  opportunity 
for  the  translator  to  generate  better  code  by  eliminating  the  jump 
instruction  and  continuing  translation  at  the  destination  of  the 
jump,  or  eliding  the  instruction.  Sometimes  eliminating  the  CTI 
helps  program  performance  by  reducing  instruction  count.  In  other 
cases,  eliminating  jumps  causes  code  to  be  duplicated  in  the  frag¬ 
ment  cache,  which  in  turn  impacts  instruction  cache  performance. 

Figure  3  compares  perfonnance  when  Strata  eliminates  jump 
instructions  with  emitting  unconditional  jump  instructions  into  the 
fragment  cache  (when  the  target  fragment  already  exists  in  the 
fragment  cache).  The  figure  shows  that  some  benchmarks  benefit 
slightly  from  the  reduction  in  instruction  count  of  eliminating  the 
branch  instructions.  Other  benchmarks,  however,  have  a  dramatic 
degradation  due  to  excessive  code  duplication.  On  average,  keep¬ 
ing  the  jump  instructions  yield  a  0.5%  improvement.  Thus,  we  con- 


2.  Reported  SPEC  results  represent  three  runs  of  SPEC.  Average,  int 
ave,  and  fp  ave  are  the  geometric  mean  of  the  SPEC  numbers 
reported  for  all  benchmarks,  integer  benchmark,  and  floating-point 
benchmarks,  respectively.  Performance  results  may  vary  from  previ¬ 
ous  publications  due  to  bug  fixes,  upgrades  to  machines  and  differ¬ 
ent  operating  systems  and  architectures. 
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Benchmark 

□  always  stop  □  stop  if  targ  exists  ■  stop  if fallthrough  exists 

□  stop  if  either  exist  □  stop  if  both  exist  n  always  continue 


Figure  2:  Performance  of  Strata  as  the  fragment  termination  condition  of  conditional  CTIs  is  changed. 
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Figure  3:  Performance  of  Strata  as  the  fragment  termination  condi¬ 
tion  of  unconditional  CTIs  is  changed. 

elude  that  keeping  the  jump  instructions  is  wise,  and  will  use  that 
policy  for  all  experiments  in  the  remainder  of  the  paper. 

3.5  Direct,  Call  Instructions 

Direct  call  instructions,  while  not  as  common  as  direct  branches, 
are  still  an  integral  part  of  most  programs.  Programs  written  in 
object-oriented  languages  promote  the  use  of  the  direct  calls  to 
invoke  the  many  small  functions  often  found  in  these  types  of  pro¬ 
grams.  Consequently,  it  is  important  for  a  dynamic  translator  to 
handle  direct  call  instructions  efficiently.  Sections  3.5. 1-3. 5. 2  dis¬ 
cuss  alternatives  when  translating  direct  call  instructions. 


3.5.1  Partial  Inlining 

Strata  (and  other  SDT  systems)  often  implement  partial  inlining. 
Partial  inlining  refers  to  translating  a  call  instruction  into  a  write  of 
the  return  address  (a  push  instruction  in  an  IA-32  instruction  set) 
and  continuing  instruction  translation  at  the  destination  of  the  call 
instruction.  Partial  inlining  was  believed  to  help  performance  by 
eliminating  unnecessary  CTIs  and  reducing  instruction  count. 

Figure  4  shows  the  performance  of  applications  run  under  Strata 
with  and  without  partial  inlining  enabled.  As  the  figure  shows,  par¬ 
tial  inlining  is  actually  a  loss  for  most  programs.  In  fact,  the  perfor¬ 
mance  loss  is  over  14%  on  average  for  the  integer  benchmarks,  and 
over  30%  on  two  benchmarks!  We  believe  that  the  primary  reason 
for  the  loss  of  performance  is  that  the  branch  predictor’s  return 
address  stack  is  not  yielding  proper  predictions  because  call 
instructions  are  eliminated  when  partial  inlining  is  used.  There  are 
also  negative  affects  from  code  duplication.  Figure  5  shows  that 
partial  inlining  results  in  more  branch  mispredictions.  On  average, 
with  partial  inlining,  there  are  2.47  times  more  mispredictions  than 
native  execution,  while  there  are  only  1 .24  times  more  mispredic¬ 
tions  without  partial  inlining. 

Although  partial  inlining  has  significant  performance  penalties  by 
negatively  impacting  the  branch  predictor,  it  may  yield  more 
opportunities  for  dynamic  optimization.  As  baseline  Strata  per¬ 
forms  no  dynamic  optimizations,  it  is  impossible  to  overcome  the 
performance  loss  of  partial  inlining.  However,  any  work  striving  to 
gain  performance  based  on  the  increased  context  of  partial  inlining 
must  first  overcome  the  negative  impacts  of  using  partial  inlining 
before  perfonnance  gains  can  be  realized. 

For  our  work,  we  conclude  that  not  using  partial  inlining  is  the  best 
translation  policy,  and  all  remaining  experiments  do  not  use  it. 

3.5.2  Lazy  Translation  of  Call  Targets 

Since  the  target  of  a  direct  call  is  known  at  translation  time,  it  is 
possible  to  speculatively  translate  the  target  into  the  fragment 
cache  when  the  call  site  is  translated.  Speculatively  translating  the 
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Figure  4:  Performance  of  Strata  when  partial  inlining  is  used  compared  to  no  partial  inlining. 
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Figure  5:  Branch  mispredictions  with  and  without  partial  inlining. 

(Missing  benchmarks  are  a  result  of  third  party  tools  failing  to  inte¬ 
grate  properly  with  system  tools.) 

target  can  potentially  save  a  context  switch  back  to  the  SDT  sys¬ 
tem.  Unfortunately,  speculative  translation  can  also  increase  pres¬ 
sure  on  the  fragment  and  instruction  caches  and  waste  translation 
time.  An  alternative  is  lazy  call  translation  that  translates  the  call 
target  once  the  call  instruction  actually  executes.  Although  the 
final  code  that  is  materialized  is  the  same,  the  placement  of  the 
code  can  be  dramatically  different.  Figure  6  shows  how  speculative 
and  lazy  call  translation  affect  the  performance  of  applications  run 
with  Strata. 

The  figure  shows  that  translating  calls  speculatively  generally 
yields  little  performance  gain.  In  fact,  several  benchmarks  (lucas, 
art,  and  perlbmk)  have  significantly  worse  performance.  We 
believe  this  is  due  to  decreased  instruction  cache  locality.  To  con¬ 
firm  this  belief,  examine  results  from  PAPI  in  Figure  7.  The  figure 
shows  the  reduction  in  the  instruction  cache  misses  when  calls  are 


lazily  instead  of  speculatively  translated.  The  figure  indicates  that 
some  benchmarks  have  dramatically  reduced  instruction  cache 
miss  rates.  On  average,  over  14%  of  instruction  cache  misses  are 
eliminated.  Unfortunately,  the  reduction  in  misses  does  not  always 
translate  to  increased  performance,  as  some  benchmarks  are  not 
limited  by  instruction  cache  fetch  rates. 

Thus,  we  believe  that  call  target  should  be  lazily  translated,  and  all 
remaining  experiments  use  lazy  translation  for  call  targets. 

3.6  Fragment  Alignment 

Fragments  are  almost  always  the  destination  of  a  branch  instruc¬ 
tion.  By  forcing  the  fragment  to  start  on  a  cache  line  boundary,  or 
aligning  the  fragment,  the  processor  can  fetch  more  useful  instruc¬ 
tions  the  cycle  after  fetching  a  branch  that  is  predicted  taken.  How¬ 
ever,  forcing  the  alignment  of  fragments  also  leaves  a  useless  hole 
in  the  fragment  cache  between  the  two  fragments.  These  holes 
potentially  cause  extra  conflicts  in  the  instruction  cache  and 
increased  pressure  in  the  fragment  cache. 

To  better  understand  the  tradeoffs  between  using  fragment  align¬ 
ment  and  not  using  it,  consider  Figure  8.  The  figure  shows  the  per¬ 
formance  with  and  without  fragment  alignment.  The  benchmarks 
generally  show  an  improvement  with  fragment  alignment, 
although  some  have  no  gain  or  even  a  performance  loss.  The  most 
telling  indicator  that  the  fragment  alignment  is  beneficial  is  that  the 
integer  benchmarks  have  nearly  2%  improvement.  These  programs 
are  also  the  most  control-intensive,  where  alignment  is  most  help¬ 
ful.  Although  the  overall  benefit  it  slight,  we  believe  that  fragment 
alignment  is  generally  good  for  performance  and  should  be  used. 
Subsequent  experiments  use  fragment  alignment. 

3.7  Trampoline  Placement 

A  trampoline  is  a  portion  of  code  emitted  into  the  fragment  cache 
that  helps  perform  a  context  switch  and  transfer  control  from  the 
application  into  the  translator.  Trampolines  are  typically  small,  on 
the  order  of  10  instructions.  However,  any  CTI  whose  destination 
is  not  already  in  the  fragment  cache  needs  a  unique  trampoline. 
Consequently,  a  single  fragment  with  10  branches  may  need  10 
trampolines.  Initially  Strata  placed  the  trampolines  for  a  fragment 
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□  speculative  call  target  translation  □  lazy  call  target  translation 


Figure  6:  Results  comparing  lazy  call  target  translation  versus  speculative  translation. 


Figure  7:  Reduction  in  instruction  cache  misses  when  call  targets 
are  translated  lazily. 

into  the  fragment  cache  immediately  after  the  fragment.  However, 
since  most  trampolines  are  used  only  once  (once  a  trampoline  is 
used,  Strata  translates  the  requested  destination,  and  fixes  direct 
control  transfer  instructions  to  branch  directly  to  their  intended 
fragment  destination),  it  may  be  desirable  to  move  the  trampoline 
into  a  separate  area  of  the  fragment  cache.  Moving  trampolines  to 
their  own  area  keeps  the  frequently  used  application  code  closer 
together  in  the  fragment  cache.  This  can  potentially  reduce  the  con¬ 
flicts  in  the  instruction  cache. 

Figure  9  shows  the  results  of  placing  trampolines  in  a  separate  area 
of  the  fragment  cache,  called  the  “trampoline  pool”.  The  first  bar 
shows  the  performance  when  a  fragment’s  trampolines  are  placed 
immediately  after  a  fragment.  The  second  bar  shows  the  perfor¬ 
mance  when  trampolines  are  placed  in  the  pool.  As  the  figure 


shows,  most  applications  differ  little,  however,  the  applications 
with  higher  instruction  cache  pressure  do  show  some  improve¬ 
ment.  Figure  10  confirms  this  by  plotting  the  reduction  in  instruc¬ 
tion  cache  misses  when  a  trampoline  pool  is  used.  In  fact,  over 
18%  of  the  cache  misses  are  eliminated  by  moving  trampolines  to 
their  own  area.  Consequently,  we  believe  that  keeping  fragment 
trampolines  in  a  pool  is  worthwhile,  and  we  use  this  policy  for  the 
remainder  of  the  experiments. 

3.8  Cross  Platform  Verification 

All  previously  described  experiments  were  run  on  an  AMD  Athlon 
Opteron  244.  However,  the  construction  strategies  used  to  maxi¬ 
mize  performance  for  this  machine  work  well  for  other  machines. 
In  particular,  finite-sized  instruction  caches  and  hardware  return 
address  stacks  are  common  features  of  most  desktop  and  server 
machines.  Unfortunately,  fully  exploring  all  possible  design  com¬ 
binations  on  a  wide  variety  of  architectures  is  prohibitively  expen¬ 
sive.  Instead,  we  examine  the  performance  of  Strata’s  initial 
configuration  compared  to  the  configuration  optimized  for  the 
AMD  machine.  Figure  11  and  Figure  12  show  results  for  an  Ultra- 
spare  Hi1  and  an  Intel  Pentium  IV  Xeon2.  The  first  bar  in  each  fig¬ 
ure  shows  the  application  performance  for  Strata’s  initial 
configuration:  End  construction  at  a  conditional  CTI,  always  elide 
unconditional  CTIs,  always  use  partial  inlining,  speculative  trans¬ 
lation  of  call  instruction  targets,  do  not  align  branch  targets,  and 
keep  fragment  trampolines  with  the  fragment.  The  second  bar 
shows  the  application’s  performance  with  the  revised  configura¬ 
tion:  Never  end  construction  on  a  conditional  CTI,  emit  uncondi¬ 
tional  CTI’s  when  possible,  avoid  partial  inlining,  translate  call 


1.  The  SUNSpro  cc  compiler  is  used  with  optimization  level  -fast  - 
x05.  Dynamic  linking  is  enabled 

2.  The  GNU  gcc  compiler  is  used  with  optimization  level  -03  -fomit- 
frame-pointer.  Dynamic  linking  is  enabled 
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Figure  8:  Performance  of  applications  under  Strata  when  fragment  alignment  is  used. 


□  tramps  w  /frag  □  special  tramp  area 


Figure  9:  Performance  of  applications  under  Strata  with  fragment  trampolines  are  adjacent  to  fragments  versus  a  separate  trampoline  pool. 


sites  lazily,  align  fragments  on  instruction  cache  line  boundaries, 
and  place  trampolines  in  a  pool. 

The  figure  shows  that  the  optimized  fragment  building  techniques 
greatly  improve  the  performance  of  both  machines;  thus,  our  tech¬ 
niques  are  sound  across  a  variety  of  architectures.  Furthermore, 
machines  with  deeper  pipelines  and  higher  memory  latencies 
(compared  to  CPU  cycle  times)  will  benefit  more  from  these  pro¬ 
posed  translation  strategies.  This  is  particularly  true  because 
branch  mispredictions  and  cache  misses  have  a  higher  performance 
penalty  in  this  case. 

3.9  Remaining  Causes  of  Overhead 

By  choosing  the  fragment  building  strategies  outlined  in  previous 
sections,  the  overhead  that  Strata  introduces  to  native  applications 
has  been  greatly  reduced:  From  16%  to  3%  for  an  Opteron,  13%  to 
4%  for  an  UltraSparc,  and  28%  to  11%  for  a  Pentium  IV  Xeon.  In 
fact,  many  of  the  programs  have  little  or  no  overhead.  Some  pro¬ 


grams  stand  out  as  an  exception,  namely  gcc,  crafty,  perlbmk, 
mesa,  and  gap.  We  believe  the  overhead  for  these  benchmarks  are 
from  three  primary  causes:  Extra  instructions  executed,  extra 
instruction  cache  misses,  and  extra  branch  mispredictions. 

3.9.1  Extra  Instructions 

Figure  13  shows  the  instruction  count  of  the  benchmarks  when  run 
under  Strata  with  optimized  fragment  building  on  a  Pentium  IV 
Xeon.  As  the  figure  shows,  the  benchmarks  that  have  no  overhead 
also  do  execute  extra  instructions.  In  these  benchmarks,  most  of 
the  time  is  spent  executing  within  the  fragment  cache  and  the  trans¬ 
lation  time  is  minimal.  The  remaining  benchmarks  execute  5-20% 
more  instructions.  With  the  optimized  fragment  building  mecha¬ 
nisms,  Strata  only  adds  extra  instructions  to  the  fragment  cache 
when  translating  indirect  branches  or  other  infrequent  special 
cases.  Unfortunately,  as  shown  in  Figure  14,  the  poor  performing 
benchmarks  all  have  a  significant  number  of  (non-return)  indirect 
branches  executed.  Furthermore,  the  two  figures  together  show  the 


128 


0 

0 

0 

60% 

50% 

ii 

40% 

30% 

.c 

20% 

c 

o 

10% 

o 

=3 

0% 

■Q 

0 

-10% 

CC 

-20% 

n 

U  . 

 □ 

□ 

 □   

. □  it  n D 

Q.  X 
CD  0 
CD  tl 
O 


.y-  O 

N  ^ 
_Q  3 


Benchmark 


Figure  10:  Percent  reduction  in  instruction  cache  misses  with 
trampoline  pool. 


number  of  indirect  branches  is  related  to  the  number  of  extra 
instructions,  providing  strong  evidence  that  nearly  all  extra  instruc¬ 
tions  executed  come  from  handling  indirect  branches. 

3.9.2  Hardware  Interactions 

Although  it  is  interesting  to  note  that  the  extra  instructions  exe¬ 
cuted  are  related  to  indirect  branch  handling,  it  is  also  important  to 
understand  why  the  execution  time  of  benchmarks  with  significant 
overhead  is  not  proportional  to  the  number  of  indirect  branches 
executed  per  instruction.  We  believe  the  disproportionately  high 
overhead  (especially  on  the  Pentium  IV  machine)  of  mesa,  gcc, 
perlbmk  and  gap  are  related  to  extra  instruction  cache  pressure  and 
extra  branch  mispredictions.  Figure  15  provides  evidence  to  sup¬ 
port  this  theory.  The  graph  plots: 

instruction  cache  misses(native)  instruction  cache  missesj Strata) 
instructions(native)  instructions)  Strata) 
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Figure  11:  Performance  on  Strata  with  initial  configuration  and  optimized  configuration  for  UltraSparc  Ili. 


Figure  12:  Performance  of  Strata  with  initial  configuration  and  optimized  configuration  for  Pentium  IV. 
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4.  Related  Work 


■a 


Figure  13:  Instruction  count  of  benchmarks  with  Strata,  normal¬ 
ized  to  native  execution. 

In  essence,  this  fonnula  calculates  the  difference  in  instruction 
cache  misses  per  instruction  between  native  execution  and  Strata 
execution.  The  figure  also  plots  the  difference  in  branch  mispredic¬ 
tions  per  instruction  and  data  cache  misses  per  instruction.  The 
well-behaved  benchmarks  not  only  have  no  extra  instructions,  they 
also  show  very  few  extra  cache  misses  or  mispredictions  per 
instruction.  The  benchmarks  with  higher  overhead  show  a  different 
pattern.  Instead,  they  show  significant  increases  in  cache  misses 
and  branch  mispredictions,  accounting  for  the  significant  extra 
overhead  seen  in  these  benchmarks.  The  poorest  performing 
benchmark,  gap,  has  the  greatest  increase  in  branch  misprediction 
rates.  This  poor  branch  predictor  performance  is  because  the 
majority  of  call  instructions  in  gap  are  indirect  calls,  resulting  in 
extremely  poor  use  of  the  hardware  return  address  stack  when 
Strata  is  performing  dynamic  translation. 

Also,  in  deeper  pipelines  and  as  pressure  on  caches  increases,  these 
effects  are  more  pronounced.  This  is  evidenced  by  the  fact  that  the 
overhead  is  higher  for  the  30-stage  pipeline  of  the  Pentium  IV  than 
for  the  15-stage  pipeline  in  the  AMD  Athlon  Opteron. 


Smith  and  Nair  describe  dynamic  basic  blocks,  which  are  the  most 
straightforward  method  of  handling  code  cache  layout  [18].  With 
dynamic  basic  blocks,  the  dynamic  translator  continues  code  lay¬ 
out  until  it  reaches  a  branch  instruction.  At  that  point  a  trampoline 
is  inserted  that  returns  control  of  the  application  back  to  the 
dynamic  translator.  Strata's  original  fragment  layout  scheme 
behaved  similarly  to  dynamic  basic  block  creation  [14,  17].  How¬ 
ever,  it  would  terminate  a  fragment  after  every  conditional  or  indi¬ 
rect  branch,  but  continue  decoding  unconditional  branches  and 
calls,  eliding  them.  HDTrans  performs  code  layout  slightly  differ¬ 
ently  [19].  They  elide  unconditional  branches,  but  they  continue 
translating  the  fall  through  instructions  after  direct  calls.  This  cre¬ 
ates  basic  blocks  that  more  closely  match  original  layout  of  the 
program  text. 

More  advanced  tracing  is  implemented  in  DynamoRIO  [3]  and  Pin 
[13].  DynamoRIO  creates  a  sharp  divide  between  its  basic  block 
cache  and  its  trace  cache.  DynamoRIO  selects  traces  to  optimize- 
based  on  next  executing  tail,  or  NET  [8],  Under  this  scheme,  once 
a  trace  head  becomes  hot,  the  trace  is  created  by  following  the  exe¬ 
cution  of  the  path  tail,  assuming  that  the  path  taken  is  a  frequently 
executed  path.  Alternately,  Pin  builds  traces  without  first  profiling 
them  as  frequently  executed.  Pin  ends  traces  on  unconditional 
branches,  or  if  it  has  translated  too  many  conditional  branches  or 
too  many  total  instructions.  Recently,  Pin  has  been  extended  to  use 
the  last-executed  iteration  (LEI)  to  better  select  traces  for  optimi¬ 
zation  [10].  Table  1  summarizes  how  different  translators  end  frag¬ 
ments. 

All  of  these  code  cache  layout  implementations  have  intuitive 
arguments  for  their  benefits,  however  each  must  translate  and  exe¬ 
cute  fragments  (at  least  until  hot  paths  have  been  selected  for  opti¬ 
mization).  Unfortunately,  no  previous  studies  have  quantified  the 
effects  of  the  initial  fragment  construction  techniques  have  on  SDT 
performance.  The  work  presented  here  differs  from  previous  work 
in  that  it  attempts  to  quantify  and  select  the  best  fragment  construc¬ 
tion  methods  for  overhead  reduction. 


o 

o 

<D 

CO 

a3 

Q. 


a> 

jz 

o 


cn 


«  30 
!  25 
2  20 
15 
10 


□  ,  □ 


.□ . y . H 


F  3  3 

c  i- 


;o 

03 

0 

o 

CL 

(/) 

■o 

-x 

i_ 

U) 

Q. 

c n 

0 

_D) 

03 

-X 

03 

0 

i_ 

E 

C- 

S 

CO 

03 

o 

03 

E 

03 

E 

03 

CD 

d 

CT 

0 

O 

C 

03 

d 

E 

s_ 

"x 

0 

£ 

(/) 

Q_ 

03 


.Q. 

N 

CD 


O 

O 

CD 


C 

Q. 

X 

CN 

M— 

0 

8 

L_ 

03 

Q. 

o 

E 

03 

0 

.9- 

o 

> 

E 

o 

0 

-9 

0 

Q- 

CD 

-c 

o 

> 

N 

-Q 

3 

03 

S 

TO 


□  calls  ■  sw  itches  □  returns 


Figure  14:  Indirect  branches  executed  per  second  under  native  execution  on  AMD  Athlon. 
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Figure  15:  Difference  in  instruction  cache  misses  and  branch 
mispredictions  per  instruction  between  native  execution  and 
execution  under  Strata. 


Table  1:  Summary  fragment  tennination  conditions  in  different 
dynamic  translators. 


Translator 

Uncond. 

CTI 

Handling 

Cond. 

CTI 

Handling 

Direct  Call 
Handling 

Max 

Insns 

Original 

Strata 

Always 

elide 

Always 

end 

Always 

inline 

Never 

end 

Opt.  Strata 

Elide  if  not 
exists 

Always 

continue 

Never  inline 

4096 

HDTrans 

Elide  if  not 
exists 

End  if  targ 
exists 

Never  inline 

DynamoRIO 

Always 

elide 

Always 

end 

Always 

inline 

End  if 

max 

insns 

Pin 

Always 

elide 

Always 

continue21 

Always 

inline 

End  if 

max 

insns 

a.  Pin  terminates  a  trace  after  a  “pre-deftned  number  of  condi¬ 
tional  control  transfers”  [13]. 


5.  Conclusions 

Software  dynamic  translation  (SDT)  has  been  used  extensively  in 
the  past  and  has  a  wide  variety  of  interesting  opportunities  for 
future  use.  In  order  for  SDT  to  be  more  pervasively  used,  SDT  sys¬ 
tems  must  add  as  little  extra  processor  time,  memory  usage,  disk 
usage  and  network  traffic  as  possible.  This  paper  has  examined 
methods  for  fragment  creation  in  SDT  systems  and  evaluated  their 
effectiveness  within  the  Strata  SDT  framework.  We  have  found 
that  fragment  creation  should  not  end  at  conditional  branches,  and 
should  end  only  at  unconditional  branches  when  the  target  frag¬ 
ment  already  exists  within  the  fragment  cache.  We  have  further 
found  that  translating  call  instructions  using  a  partial  inlining  tech¬ 
nique  negatively  affects  the  hardware  branch  predictor  and  lazily 
translating  call  site  targets  improves  instruction  cache  locality. 


Instruction  cache  locality  is  also  improved  by  placing  fragment 
trampolines  into  a  special  area  of  the  fragment  cache  that  is  disjoint 
from  the  application’s  instructions.  In  addition  to  evaluating  the 
fragment  building  techniques  on  an  AMD  Athlon  Opteron  244,  the 
paper  further  validates  the  findings  on  two  other  machines,  an  Intel 
Pentium  IV  Xeon  and  a  Sun  UltraSparc  Hi.  We  find  that  overhead 
can  be  reduced  from  16%,  13%,  and  28%  to  3%,  4%,  and  11%  for 
these  Opteron,  UltraSparc,  and  Xeon,  respectively  on  the  entire 
SPEC2000  benchmark  suite.  Lastly,  we  examine  the  remaining 
causes  of  overhead  and  find  that  it  is  directly  related  to  the  han¬ 
dling  of  indirect  branches,  partially  due  to  executing  extra  instruc¬ 
tions,  but  also  due  to  decreased  hardware  performance  on  the 
transformed  program. 
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